This notebook aims to show how we could approach this competition in a realistic setting. This means in a setting where we genuinely do not have the test set and we have to deliver a model that performs reasonably well. Moreover, we want to be able to explain where our model shines and where it is not as trustworthy.
The key concepts for a realistic end-to-end project that we will explore are:
The steps we will follow are:
Note: this notebook is still a work in progress, in bold the sections that are already there
Note: from version 23, I needed a newer version of scikit-learn, please indulge me in the occasional execution error and remember to turn the internet on if you fork this notebook
!pip install scikit-learn==0.22
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import StratifiedShuffleSplit, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.inspection import plot_partial_dependence
from sklearn.linear_model import Lasso, Ridge, SGDRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
import xgboost as xgb
import lightgbm as lgb
import df_pipeline as dfp # all the custom pipeline parts
import explore_data as exp # all function to quickly explore the data
import warnings
pd.set_option('max_columns', 200)
The goal here is only to get an idea about the structure, the data types, the missing values. We should try to not take any decisions just yet because we did not set up any validation strategy yet.
Note also that we do not load the test set, for the purposes of this notebook the test set is needed only for the final submission. Imagine the test set as the one that your client sends you a month after you delivered your model.
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df_train.head()
Very well, we have a good combination of numerical, ordinal, and categorical features (which is also why this is a good competition to start with). We also have quite a few missing values (most of which are explained in the data description).
col_mis = exp.list_missing(df_train)
And, by looking at the distribution of the data, we have some odd values
df_train.hist(bins=50, figsize=(20,15))
plt.show()
All the information I want to get from these plots are about odd distributions, not only I see there is a skew in some continuous variables, but also that some features have mostly one value.
One way to remove the skeweness is to simply take the logarithm of the variable. This is what we are going to do now with the target variable
df_train['target'] = np.log1p(df_train.SalePrice)
del df_train['SalePrice']
We will implement a cleaning procedure that follows the documentation shortly but, before even creating the evaluation environment, we want to remove 2 outliers that the documentation recommends to remove. See this step as following the instructions that came with the data.
df_train = df_train[df_train.GrLivArea < 4500].reset_index(drop=True)
Now, since we want to evaluate our model in order to be able to say how good or bad it can be in certain situations, we need to create our test set (not the test set provided by kaggle, that one will arrive in a month after we are done). Every insight and every decision will come from something we will do on the train set, leaving the evaluation of our choices as pure as possible.
If we were doing the data exploration phase before this step, we would have used information coming from both sets to take decisions, a luxury that in a realistic situation we won't have.
How to split the data? Giving the size of the training set, it makes sense to use a 80-20 split. A random split will do the job just fine but we can also use some knowledge of the problem. Since all that matters for a house is location, location, location , we can make the split in order to correctly represent the distribution of the houses across the various Neighborhoods (i.e. we can stratify the split).
def make_test(train, test_size, random_state, strat_feat=None):
if strat_feat:
split = StratifiedShuffleSplit(n_splits=1, test_size=test_size, random_state=random_state)
for train_index, test_index in split.split(train, train[strat_feat]):
train_set = train.loc[train_index]
test_set = train.loc[test_index]
return train_set, test_set
train_set, test_set = make_test(df_train,
test_size=0.2, random_state=654,
strat_feat='Neighborhood')
And we see that the proportion of houses per Neighborhood is approximatively preserved
train_set.Neighborhood.value_counts(normalize=True)
df_train.Neighborhood.value_counts(normalize=True)
Before getting insights from the data, let's take the final step of the instructions that came with the data (i.e. the data description) and have a general cleaning
class general_cleaner(BaseEstimator, TransformerMixin):
'''
This class applies what we know from the documetation.
It cleans some known missing values
If flags the missing values
This process is supposed to happen as first step of any pipeline
'''
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#LotFrontage
X.loc[X.LotFrontage.isnull(), 'LotFrontage'] = 0
#Alley
X.loc[X.Alley.isnull(), 'Alley'] = "NoAlley"
#MSSubClass
X['MSSubClass'] = X['MSSubClass'].astype(str)
#MissingBasement
fil = ((X.BsmtQual.isnull()) & (X.BsmtCond.isnull()) & (X.BsmtExposure.isnull()) &
(X.BsmtFinType1.isnull()) & (X.BsmtFinType2.isnull()))
fil1 = ((X.BsmtQual.notnull()) | (X.BsmtCond.notnull()) | (X.BsmtExposure.notnull()) |
(X.BsmtFinType1.notnull()) | (X.BsmtFinType2.notnull()))
X.loc[fil1, 'MisBsm'] = 0
X.loc[fil, 'MisBsm'] = 1 # made explicit for safety
#BsmtQual
X.loc[fil, 'BsmtQual'] = "NoBsmt" #missing basement
#BsmtCond
X.loc[fil, 'BsmtCond'] = "NoBsmt" #missing basement
#BsmtExposure
X.loc[fil, 'BsmtExposure'] = "NoBsmt" #missing basement
#BsmtFinType1
X.loc[fil, 'BsmtFinType1'] = "NoBsmt" #missing basement
#BsmtFinType2
X.loc[fil, 'BsmtFinType2'] = "NoBsmt" #missing basement
#BsmtFinSF1
X.loc[fil, 'BsmtFinSF1'] = 0 # No bsmt
#BsmtFinSF2
X.loc[fil, 'BsmtFinSF2'] = 0 # No bsmt
#BsmtUnfSF
X.loc[fil, 'BsmtUnfSF'] = 0 # No bsmt
#TotalBsmtSF
X.loc[fil, 'TotalBsmtSF'] = 0 # No bsmt
#BsmtFullBath
X.loc[fil, 'BsmtFullBath'] = 0 # No bsmt
#BsmtHalfBath
X.loc[fil, 'BsmtHalfBath'] = 0 # No bsmt
#FireplaceQu
X.loc[(X.Fireplaces == 0) & (X.FireplaceQu.isnull()), 'FireplaceQu'] = "NoFire" #missing
#MisGarage
fil = ((X.GarageYrBlt.isnull()) & (X.GarageType.isnull()) & (X.GarageFinish.isnull()) &
(X.GarageQual.isnull()) & (X.GarageCond.isnull()))
fil1 = ((X.GarageYrBlt.notnull()) | (X.GarageType.notnull()) | (X.GarageFinish.notnull()) |
(X.GarageQual.notnull()) | (X.GarageCond.notnull()))
X.loc[fil1, 'MisGarage'] = 0
X.loc[fil, 'MisGarage'] = 1
#GarageYrBlt
X.loc[X.GarageYrBlt > 2200, 'GarageYrBlt'] = 2007 #correct mistake
X.loc[fil, 'GarageYrBlt'] = X['YearBuilt'] # if no garage, use the age of the building
#GarageType
X.loc[fil, 'GarageType'] = "NoGrg" #missing garage
#GarageFinish
X.loc[fil, 'GarageFinish'] = "NoGrg" #missing
#GarageQual
X.loc[fil, 'GarageQual'] = "NoGrg" #missing
#GarageCond
X.loc[fil, 'GarageCond'] = "NoGrg" #missing
#Fence
X.loc[X.Fence.isnull(), 'Fence'] = "NoFence" #missing fence
#Pool
fil = ((X.PoolArea == 0) & (X.PoolQC.isnull()))
X.loc[fil, 'PoolQC'] = 'NoPool'
del X['Id']
del X['MiscFeature'] # we already know it doesn't matter
return X
Why a class? Because we want to be able to reproduce every step on unseen data and a class makes easier to have a very clear set of steps to follow. More explanation on this and on the use of Pipelines (that will come later) can be found in this other notebook https://www.kaggle.com/lucabasa/understand-and-use-a-pipeline
To use this class and clean our data, we can simply do
train_cleaned = train_set.copy() # I want to work on train_set again later from scratch
train_cleaned = general_cleaner().fit_transform(train_cleaned)
mis_cols = exp.list_missing(train_cleaned)
We see that most of the missing values were actually very well explained by the documentation, we will deal with the remaining ones later on.
Please note that we don't know what is going to be missing in the future, thus, even if there were no missing values remaining in our data, we should implement procedures to clean the data anyway (or, in alternative, throw appropriate errors so that the team maintaing the model will know quickly what to do).
This phase is tricky because it is easy to go too deep into the rabbit hole. We want to have a comprehensive understanding of the data but at the same time we don't want to spend too much time on it. While in this dataset the number of features is fairly limited, in other cases a slow approach can cost us weeks of work and, at this stage, we don't know if this work is worth weeks of our time.
We will thus focus on the following:
Once again, please note how we don't use the full training set but only what we have created in the previous section.
To first identify what can be interesting, we can look the correlations with the target. This is not by any means enough or necessary to make a feature interesting but, in absence of other knowledge, it is a good first step.
high_corr = exp.plot_correlations(train_cleaned, 'target', limit=10, annot=True)
high_corr = exp.plot_correlations(train_cleaned, 'target', limit=20)
Nothing particularly shocking here:
Let's have a look at some distributions, starting with the target variable.
exp.plot_distribution(train_cleaned, 'target')
Yes, we removed the skewness with that logarithm transformation in the previous section.
Let's continue by looking at the numerical features most correlated with the target.
for col in high_corr[1:6].index:
exp.plot_distribution(train_cleaned, col, correlation=high_corr)
We see, for example, that the negative skew is also present in GrLivArea, a feature that we all expect playing a big role in determining the final price.
Let's try to get more insights with a bivariate analysis
exp.corr_target(train_cleaned, 'target', list(high_corr[1:12].index))
The discrete values of GarageCars and OverallQual invite in using some estimator to see if the patter is clear as it looks.
exp.corr_target(train_cleaned, 'target',
[col for col in high_corr.index if 'Qual' in col or 'Car' in col],
x_estimator=np.median)
Yes, it is very clear indeed. Let's move on to the categorical features.
In this dataset we don't have many features and, thanks to the function in the explore_data utility script, we can quickly shuffle through all of them to find the interesting one. If this was not possible, one approach would be to find the features whose cateogories exhibit a significantly different distribution of the target variable. In other words, for example, if the distribution of prices for houses withouth a Fence is significantly different than the ones with a Fence, the next function will catch it and tell us.
exp.find_cats(train_cleaned, 'target', thrs=0.3, critical=0.05)
We can start from these to see if we get some insight
exp.segm_target(train_cleaned, 'BsmtQual', 'target')
It seems there is a very clear relation between the quality of the basement and the sale price. Moreover, it invites in recoding BsmtQual into an ordinal feature.
The same holds for KitchenQual
exp.segm_target(train_cleaned, 'KitchenQual', 'target')
Or of ExterQual
exp.segm_target(train_cleaned, 'ExterQual', 'target')
exp.segm_target(train_cleaned, 'LotShape', 'target')
Here it looks it makes a difference only in having a regular vs irregular lot shape.
exp.segm_target(train_cleaned, 'MasVnrType', 'target')
The other features found above do not show anything in particular on this kind of plots, so we won't display them. We can have a quick look at other features not found by that function, for example
exp.segm_target(train_cleaned, 'CentralAir', 'target')
exp.segm_target(train_cleaned, 'GarageFinish', 'target')
Another way of looking at these features is to combine them in a plot together with another numerical feature. We know that GrLivArea is going to be important and our function told us that HouseStyle has interesting categories in it. We can combine them like this
exp.plot_bivariate(train_cleaned, 'GrLivArea', 'target', hue='HouseStyle', alpha=0.7)
Or use this plot to just investigate further the features we have analyzed before
exp.plot_bivariate(train_cleaned, 'GrLivArea', 'target', hue='ExterQual', alpha=0.7)
As mentioned before, it is easy to go too deep in the rabbit hole in this phase. I have been writing this section for an hour already and I feel confident enough about the data to run the first models.
This section has multiple goals, the main one is to bridge to the next section about feature engineering. We want to implement a system of evaluating, selecting, and tuning our models that is robust enough to the data processing phase. If we do this part correctly, we will be able to quickly iterate between the processing and modeling phase, being confident that we can correctly assess the validity of our actions.
As explained already in this notebook https://www.kaggle.com/lucabasa/understand-and-use-a-pipeline, we need different transformations for numeric and categorical features. We will use the insights gained in the previous section to make a few new custom transformers.
class tr_numeric(BaseEstimator, TransformerMixin):
def __init__(self, SF_room=True, bedroom=True, bath=True, lot=True, service=True):
self.columns = [] # useful to well behave with FeatureUnion
def fit(self, X, y=None):
return self
def remove_skew(self, X, column):
X[column] = np.log1p(X[column])
return X
def transform(self, X, y=None):
for col in ['GrLivArea', '1stFlrSF', 'LotArea']:
X = self.remove_skew(X, col)
self.columns = X.columns
return X
def get_feature_names(self):
return self.columns
class make_ordinal(BaseEstimator, TransformerMixin):
'''
Transforms ordinal features in order to have them as numeric (preserving the order)
If unsure about converting or not a feature (maybe making dummies is better), make use of
extra_cols and unsure_conversion
'''
def __init__(self, cols, extra_cols=None, include_extra=True):
self.cols = cols
self.extra_cols = extra_cols
self.mapping = {'Po':1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
self.include_extra = include_extra
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
if self.extra_cols:
if self.include_extra:
self.cols += self.extra_cols
else:
for col in self.extra_cols:
del X[col]
for col in self.cols:
X.loc[:, col] = X[col].map(self.mapping).fillna(0)
return X
class recode_cat(BaseEstimator, TransformerMixin):
'''
Recodes some categorical variables according to the insights gained from the
data exploration phase.
'''
def fit(self, X, y=None):
return self
def tr_GrgType(self, data):
data['GarageType'] = data['GarageType'].map({'Basment': 'Attchd',
'CarPort': 'Detchd',
'2Types': 'Attchd' }).fillna(data['GarageType'])
return data
def tr_LotShape(self, data):
fil = (data.LotShape != 'Reg')
data['LotShape'] = 1
data.loc[fil, 'LotShape'] = 0
return data
def tr_LandCont(self, data):
fil = (data.LandContour == 'HLS') | (data.LandContour == 'Low')
data['LandContour'] = 0
data.loc[fil, 'LandContour'] = 1
return data
def tr_MSZoning(self, data):
data['MSZoning'] = data['MSZoning'].map({'RH': 'RM', # medium and high density
'C (all)': 'RM', # commercial and medium density
'FV': 'RM'}).fillna(data['MSZoning'])
return data
def tr_Alley(self, data):
fil = (data.Alley != 'NoAlley')
data['Alley'] = 0
data.loc[fil, 'Alley'] = 1
return data
def tr_LotConfig(self, data):
data['LotConfig'] = data['LotConfig'].map({'FR3': 'Corner', # corners have 2 or 3 free sides
'FR2': 'Corner'}).fillna(data['LotConfig'])
return data
def tr_BldgType(self, data):
data['BldgType'] = data['BldgType'].map({'Twnhs' : 'TwnhsE',
'2fmCon': 'Duplex'}).fillna(data['BldgType'])
return data
def tr_MasVnrType(self, data):
data['MasVnrType'] = data['MasVnrType'].map({'BrkCmn': 'BrkFace'}).fillna(data['MasVnrType'])
return data
def tr_HouseStyle(self, data):
data['HouseStyle'] = data['HouseStyle'].map({'1.5Fin': '1.5Unf',
'2.5Fin': '2Story',
'2.5Unf': '2Story',
'SLvl': 'SFoyer'}).fillna(data['HouseStyle'])
return data
def transform(self, X, y=None):
X = self.tr_GrgType(X)
X = self.tr_LotShape(X)
X = self.tr_LotConfig(X)
X = self.tr_MSZoning(X)
X = self.tr_Alley(X)
X = self.tr_LandCont(X)
X = self.tr_BldgType(X)
X = self.tr_MasVnrType(X)
X = self.tr_HouseStyle(X)
return X
class drop_columns(BaseEstimator, TransformerMixin):
'''
Drops columns that are not useful for the model
'''
def __init__(self):
self.columns = []
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
to_drop = [col for col in X.columns if 'NoGrg' in col] # dropping dummies that are redundant
to_drop += [col for col in X.columns if 'NoBsmt' in col]
# other not useful columns
to_drop += [col for col in X.columns if 'MSSubClass' in col]
to_drop += [col for col in X.columns if 'Neighborhood' in col] # maybe useful in the future
to_drop += [col for col in X.columns if 'Condition1' in col]
to_drop += [col for col in X.columns if 'Condition2' in col]
to_drop += [col for col in X.columns if 'ExterCond' in col] # maybe make it ordinal
to_drop += [col for col in X.columns if 'Exterior1st' in col]
to_drop += [col for col in X.columns if 'Exterior2nd' in col]
to_drop += [col for col in X.columns if 'Functional' in col]
to_drop += [col for col in X.columns if 'Heating_' in col] # we don't want to drop the dummies of HeatingQC too
to_drop += [col for col in X.columns if 'PoolQC' in col]
to_drop += [col for col in X.columns if 'RoofMatl' in col]
to_drop += [col for col in X.columns if 'RoofStyle' in col]
to_drop += [col for col in X.columns if 'SaleCondition' in col]
to_drop += [col for col in X.columns if 'SaleType' in col]
to_drop += [col for col in X.columns if 'Utilities' in col]
to_drop += [col for col in X.columns if 'BsmtCond' in col] # maybe ordinal
to_drop += [col for col in X.columns if 'Electrical' in col]
to_drop += [col for col in X.columns if 'Foundation' in col]
to_drop += [col for col in X.columns if 'LandSlope' in col]
to_drop += [col for col in X.columns if 'Street' in col]
to_drop += [col for col in X.columns if 'Fence' in col]
to_drop += [col for col in X.columns if 'PavedDrive' in col]
for col in to_drop:
try:
del X[col]
except KeyError:
pass
self.columns = X.columns
return X
def get_feature_names(self):
return list(self.columns)
numeric_pipe = Pipeline([('fs', dfp.feat_sel('numeric')),
('imputer', dfp.df_imputer(strategy='median')),
('transf', tr_numeric())])
cat_pipe = Pipeline([('fs', dfp.feat_sel('category')),
('imputer', dfp.df_imputer(strategy='most_frequent')),
('ord', make_ordinal(['BsmtQual', 'KitchenQual','GarageQual',
'GarageCond', 'ExterQual', 'HeatingQC'])),
('recode', recode_cat()),
('dummies', dfp.dummify())])
processing_pipe = dfp.FeatureUnion_df(transformer_list=[('cat_pipe', cat_pipe),
('num_pipe', numeric_pipe)])
full_pipe = Pipeline([('gen_cl', general_cleaner()),
('processing', processing_pipe),
('scaler', dfp.df_scaler()),
('dropper', drop_columns())])
tmp = train_set.copy()
full_pipe.fit_transform(tmp).head()
As you can see, we fill the missing values either with the median or with the mode, we remove the skew from some numerical features, we make some categorical feature ordinal, and we recode some categories accordingly to what we have already seen in the data exploration phase. By also using those insights, we drop some columns to make our model simpler. A few of these columns might be useful later on but for now we can safely drop them.
At the end of this pipeline, we will put a model. We know we need a regressor, but we still don't know what model is going to be good enough for us. We thus need to implement some methods to evaluate the models. At that point, we can simply try some models and focus on the most promising one.
There is no point now to play around with hyperparameters as we still have to try to improve the quality of our data before, which will improve our model far more than a fine tuned parameter
def cv_score(df_train, y_train, kfolds, pipeline, imp_coef=False):
oof = np.zeros(len(df_train))
train = df_train.copy()
feat_df = pd.DataFrame()
for n_fold, (train_index, test_index) in enumerate(kfolds.split(train.values)):
trn_data = train.iloc[train_index][:]
val_data = train.iloc[test_index][:]
trn_target = y_train.iloc[train_index].values.ravel()
val_target = y_train.iloc[test_index].values.ravel()
pipeline.fit(trn_data, trn_target)
oof[test_index] = pipeline.predict(val_data).ravel()
if imp_coef:
try:
fold_df = get_coef(pipeline)
except AttributeError:
fold_df = get_feature_importance(pipeline)
fold_df['fold'] = n_fold + 1
feat_df = pd.concat([feat_df, fold_df], axis=0)
if imp_coef:
feat_df = feat_df.groupby('feat')['score'].agg(['mean', 'std'])
feat_df['abs_sco'] = (abs(feat_df['mean']))
feat_df = feat_df.sort_values(by=['abs_sco'],ascending=False)
del feat_df['abs_sco']
return oof, feat_df
else:
return oof
def grid_search(data, target, estimator, param_grid, scoring, cv, random=False):
if random:
grid = RandomizedSearchCV(estimator=estimator, param_distributions=param_grid, cv=cv, scoring=scoring,
n_iter=random, n_jobs=-1, random_state=434, iid=False)
else:
grid = GridSearchCV(estimator=estimator, param_grid=param_grid,
cv=cv, scoring=scoring, n_jobs=-1, return_train_score=False)
pd.options.mode.chained_assignment = None # turn on and off a warning of pandas
tmp = data.copy()
grid = grid.fit(tmp, target)
pd.options.mode.chained_assignment = 'warn'
result = pd.DataFrame(grid.cv_results_).sort_values(by='mean_test_score',
ascending=False).reset_index()
del result['params']
times = [col for col in result.columns if col.endswith('_time')]
params = [col for col in result.columns if col.startswith('param_')]
result = result[params + ['mean_test_score', 'std_test_score'] + times]
return result, grid.best_params_, grid.best_estimator_
def get_coef(pipe):
imp = pipe.steps[-1][1].coef_.tolist()
feats = pipe.steps[-2][1].get_feature_names()
result = pd.DataFrame({'feat':feats,'score':imp})
result['abs_res'] = abs(result['score'])
result = result.sort_values(by=['abs_res'],ascending=False)
del result['abs_res']
return result
def get_feature_importance(pipe):
imp = pipe.steps[-1][1].feature_importances_.tolist() #it's a pipeline
feats = pipe.steps[-2][1].get_feature_names()
result = pd.DataFrame({'feat':feats,'score':imp})
result = result.sort_values(by=['score'],ascending=False)
return result
def _plot_diagonal(ax):
xmin, xmax = ax.get_xlim()
ymin, ymax = ax.get_ylim()
low = min(xmin, xmax)
high = max(xmin, xmax)
scl = (high - low) / 100
line = pd.DataFrame({'x': np.arange(low, high ,scl), # small hack for a diagonal line
'y': np.arange(low, high ,scl)})
ax.plot(line.x, line.y, color='black', linestyle='--')
return ax
def plot_predictions(data, true_label, pred_label, feature=None, hue=None, legend=False):
tmp = data.copy()
tmp['Prediction'] = pred_label
tmp['True Label'] = true_label
tmp['Residual'] = tmp['True Label'] - tmp['Prediction']
diag = False
alpha = 0.7
label = ''
fig, ax = plt.subplots(1,2, figsize=(15,6))
if feature is None:
feature = 'True Label'
diag = True
else:
legend = 'full'
sns.scatterplot(x=feature, y='True Label', data=tmp, ax=ax[0], label='True',
hue=hue, legend=legend, alpha=alpha)
label = 'Predicted'
alpha = 0.4
sns.scatterplot(x=feature, y='Prediction', data=tmp, ax=ax[0], label=label,
hue=hue, legend=legend, alpha=alpha)
if diag:
ax[0] = _plot_diagonal(ax[0])
sns.scatterplot(x=feature, y='Residual', data=tmp, ax=ax[1],
hue=hue, legend=legend, alpha=0.7)
ax[1].axhline(y=0, color='r', linestyle='--')
ax[0].set_title(f'{feature} vs Predictions')
ax[1].set_title(f'{feature} vs Residuals')
models = [('lasso', Lasso(alpha=0.01)), ('ridge', Ridge()), ('sgd', SGDRegressor()),
('forest', RandomForestRegressor(n_estimators=200)), ('xtree', ExtraTreesRegressor(n_estimators=200)),
('svr', SVR()),
('kneig', KNeighborsRegressor()),
('xgb', xgb.XGBRegressor(n_estimators=200, objective='reg:squarederror')),
('lgb', lgb.LGBMRegressor(n_estimators=200))]
mod_name = []
rmse_train = []
rmse_test = []
mae_train = []
mae_test = []
folds = KFold(5, shuffle=True, random_state=541)
y = train_set['target'].copy()
del train_set['target']
y_test = test_set['target']
del test_set['target']
warnings.filterwarnings("ignore",
message="The dummies in this set do not match the ones in the train set, we corrected the issue.")
for model in models:
train = train_set.copy()
test = test_set.copy()
print(model[0])
mod_name.append(model[0])
pipe = [('gen_cl', general_cleaner()),
('processing', processing_pipe),
('scl', dfp.df_scaler()),
('dropper', drop_columns())] + [model]
model_pipe = Pipeline(pipe)
inf_preds = cv_score(train, y, folds, model_pipe)
model_pipe.fit(train, y) # refit on full train set
preds = model_pipe.predict(test)
rmse_train.append(mean_squared_error(y, inf_preds))
rmse_test.append(mean_squared_error(y_test, preds))
mae_train.append(mean_absolute_error(np.expm1(y), np.expm1(inf_preds)))
mae_test.append(mean_absolute_error(np.expm1(y_test), np.expm1(preds)))
print(f'\tTrain set RMSE: {round(np.sqrt(mean_squared_error(y, inf_preds)), 4)}')
print(f'\tTrain set MAE: {round(mean_absolute_error(np.expm1(y), np.expm1(inf_preds)), 2)}')
print(f'\tTest set RMSE: {round(np.sqrt(mean_squared_error(y_test, preds)), 4)}')
print(f'\tTest set MAE: {round(mean_absolute_error(np.expm1(y_test), np.expm1(preds)), 2)}')
print('_'*40)
print('\n')
results = pd.DataFrame({'model_name': mod_name,
'rmse_train': rmse_train, 'rmse_test': rmse_test,
'mae_train': mae_train, 'mae_test': mae_test})
results
We will continue with the top5: Lasso, RandomForest, XGBoost, LGBoost, Ridge. We will have these results as baseline, being them the one we obtained by simply looking at the data.
We can investigate each of these models by looking at their predictions and at how those predictions are made. For example
lasso_pipe = Pipeline([('gen_cl', general_cleaner()),
('processing', processing_pipe),
('scl', dfp.df_scaler()),
('dropper', drop_columns()),
('lasso', Lasso(alpha=0.01))])
lasso_oof, lasso_coef = cv_score(train_set, y, folds, lasso_pipe, imp_coef=True)
lasso_coef
plot_predictions(train_set, y, lasso_oof)
This residual plot is fairly problematic and we will address it later on.
We also notice that quite a few features do not matter for our predictions
lasso_coef[lasso_coef['mean']==0].sample(10)
We have already implemented a few transformations for our base models. For example, we have already removed the skew from some numerical features and we can see that the result is promising
for col in ['GrLivArea', '1stFlrSF', 'LotArea']:
train_cleaned[col] = np.log1p(train_cleaned[col])
exp.corr_target(train_cleaned, 'target', ['GrLivArea', '1stFlrSF', 'LotArea'])
Moreover, we have already transformed some categorical features into ordinal ones
train_cleaned = make_ordinal(['BsmtQual', 'KitchenQual',
'GarageQual','GarageCond',
'ExterQual', 'HeatingQC']).fit_transform(train_cleaned)
exp.corr_target(train_cleaned, 'target', ['BsmtQual', 'KitchenQual',
'GarageQual','GarageCond',
'ExterQual', 'HeatingQC'], x_estimator=np.mean)
We indeed observe a linear relation between the target and some of these features. For others it will be probably be better to use a different encoding.
At last, at least regarding what we have already implemented, we recoded a few categories following common sense. The goal here is to not train on too rare categories as this can lead to poor generalizability of the model. We report here only one of those transformations as an example
train_cleaned = recode_cat().fit_transform(train_cleaned)
exp.segm_target(train_cleaned, 'GarageType', 'target')
As we can see, the differences among different categories are more clear. This will definitely help the models.
We will now use some insights gained during the data exploration phase to make better features.
We noticed a good correlation between TotRmsAbvGrd, GrLivArea, and BedroomAbvGr, which is nothing shocking. Let's see what an interaction between the two looks like.
def SF_per_room(data):
data['sf_per_room'] = data['GrLivArea'] / data['TotRmsAbvGrd']
return data
def bedroom_prop(data):
data['bedroom_prop'] = data['BedroomAbvGr'] / data['TotRmsAbvGrd']
return data
train_cleaned = SF_per_room(train_cleaned)
train_cleaned = bedroom_prop(train_cleaned)
exp.corr_target(data=train_cleaned,
cols=['GrLivArea', 'TotRmsAbvGrd', 'BedroomAbvGr', 'sf_per_room', 'bedroom_prop'],
target='target')
It looks we found something: the bigger the house, the more it costs, but the bigger the rooms (on average), the less the house is expensive.
Next, we have a lot of bath features that individually are not saying too much. Let's try to make something out of them.
def total_bath(data):
data['total_bath'] = data[[col for col in data.columns if 'FullBath' in col]].sum(axis=1) \
+ 0.5 * data[[col for col in data.columns if 'HalfBath' in col]].sum(axis=1)
return data
train_cleaned = total_bath(train_cleaned)
exp.corr_target(data=train_cleaned,
cols=[col for col in train_cleaned if 'Bath' in col] + ['total_bath'],
target='target')
This is indeed a clearer signal and we should consider using this feature and dropping the other bath features.
We also have quite some features regarding external area (porch, Lot, Pool), but with the exception of LotArea, they are not very relevant. We thus can try to see if some interaction with the internal area is relevant.
def lot_prop(data):
data['lot_prop'] = data['LotArea'] / data['GrLivArea']
return data
train_cleaned = lot_prop(train_cleaned)
exp.corr_target(data=train_cleaned,
cols=['LotArea', 'GrLivArea', 'lot_prop'],
target='target')
Nothing game-changing here but, with the same spirity, we can see what role what we can call service areas (Basement and Garage) are playing.
def service_area(data):
data['service_area'] = data['TotalBsmtSF'] + data['GarageArea']
return data
train_cleaned = service_area(train_cleaned)
exp.corr_target(data=train_cleaned,
cols=['TotalBsmtSF', 'GarageArea', 'service_area'],
target='target')
Since we already have GarageCars to describe the Garage and this new feature is very correlated with the basement SF, we could consider if it is better to use it and drop the original two.
At last, for now, we can use a very powerful but dangerous encoding technique to make a better use of Neighborhood. We want to use target encoding and, as we will see, it will give us a very predictive feature. However, one must be cautious about using this kind of encoding as it easily lead to severe overfitting.
Two things can prevent the overfitting: the use of pipeline (so we are sure that the encoding happens only on the training set) and some smoothing techniques (so that there is some noise in the feature and the signal is not unrealistically clean). We will implement these features shortly but, for now, we explore the possibilities we have.
def tr_neighborhood(data, y=None):
# mean and standard deviation of the price per Neighborhood
means = data.groupby('Neighborhood')['target'].mean()
stds = data.groupby('Neighborhood')['target'].std()
data['Neig_target_mean'] = data['Neighborhood'].map(means)
data['Neig_target_std'] = data['Neighborhood'].map(stds)
# mean and standard deviation of the house size per Neighborhood
means = data.groupby('Neighborhood')['GrLivArea'].mean()
stds = data.groupby('Neighborhood')['GrLivArea'].std()
data['Neig_GrLivArea_mean'] = data['Neighborhood'].map(means)
data['Neig_GrLivArea_std'] = data['Neighborhood'].map(stds)
return data
train_cleaned = tr_neighborhood(train_cleaned)
exp.corr_target(train_cleaned, 'target', ['GrLivArea',
'Neig_target_mean','Neig_target_std',
'Neig_GrLivArea_mean', 'Neig_GrLivArea_std'
])
It looks like the target encoding with the mean is very powerful and not very correlated with the other features. We will use this one.
The same could be done with MSSubclass, being a very mysterious feature.
def tr_mssubclass(data, y=None):
# mean and standard deviation of the price per Neighborhood
means = data.groupby('MSSubClass')['target'].mean()
stds = data.groupby('MSSubClass')['target'].std()
data['MSSC_target_mean'] = data['MSSubClass'].map(means)
data['MSSC_target_std'] = data['MSSubClass'].map(stds)
# mean and standard deviation of the house size per Neighborhood
means = data.groupby('MSSubClass')['GrLivArea'].mean()
stds = data.groupby('MSSubClass')['GrLivArea'].std()
data['MSSC_GrLivArea_mean'] = data['MSSubClass'].map(means)
data['MSSC_GrLivArea_std'] = data['MSSubClass'].map(stds)
return data
train_cleaned = tr_mssubclass(train_cleaned)
exp.corr_target(train_cleaned, 'target', ['GrLivArea',
'MSSC_target_mean','MSSC_target_std',
'MSSC_GrLivArea_mean', 'MSSC_GrLivArea_std'
])
Not as evident as before but the mean target encoding looks promising. Being a very sparse feature, we will need to be careful with this encoding as there are a few very rare features and this, as mentioned before, will lead to overfitting (especially with some algorithms).
We have seen from the previous models already that some features do not look important at all. Not only they slow down the learning, but they can even make our models learn the wrong relationships. We thus now modify the pipeline in order to get rid of these features (it also makes the processing a bit quicker, computation-wise)
class general_cleaner(BaseEstimator, TransformerMixin):
'''
This class applies what we know from the documetation.
It cleans some known missing values
If flags the missing values
This process is supposed to happen as first step of any pipeline
'''
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
#LotFrontage
X.loc[X.LotFrontage.isnull(), 'LotFrontage'] = 0
#Alley
X.loc[X.Alley.isnull(), 'Alley'] = "NoAlley"
#MSSubClass
X['MSSubClass'] = X['MSSubClass'].astype(str)
#MissingBasement
fil = ((X.BsmtQual.isnull()) & (X.BsmtCond.isnull()) & (X.BsmtExposure.isnull()) &
(X.BsmtFinType1.isnull()) & (X.BsmtFinType2.isnull()))
fil1 = ((X.BsmtQual.notnull()) | (X.BsmtCond.notnull()) | (X.BsmtExposure.notnull()) |
(X.BsmtFinType1.notnull()) | (X.BsmtFinType2.notnull()))
X.loc[fil1, 'MisBsm'] = 0
X.loc[fil, 'MisBsm'] = 1 # made explicit for safety
#BsmtQual
X.loc[fil, 'BsmtQual'] = "NoBsmt" #missing basement
#BsmtCond
X.loc[fil, 'BsmtCond'] = "NoBsmt" #missing basement
#BsmtExposure
X.loc[fil, 'BsmtExposure'] = "NoBsmt" #missing basement
#BsmtFinType1
X.loc[fil, 'BsmtFinType1'] = "NoBsmt" #missing basement
#BsmtFinType2
X.loc[fil, 'BsmtFinType2'] = "NoBsmt" #missing basement
#BsmtFinSF1
X.loc[fil, 'BsmtFinSF1'] = 0 # No bsmt
#BsmtFinSF2
X.loc[fil, 'BsmtFinSF2'] = 0 # No bsmt
#BsmtUnfSF
X.loc[fil, 'BsmtUnfSF'] = 0 # No bsmt
#TotalBsmtSF
X.loc[fil, 'TotalBsmtSF'] = 0 # No bsmt
#BsmtFullBath
X.loc[fil, 'BsmtFullBath'] = 0 # No bsmt
#BsmtHalfBath
X.loc[fil, 'BsmtHalfBath'] = 0 # No bsmt
#FireplaceQu
X.loc[(X.Fireplaces == 0) & (X.FireplaceQu.isnull()), 'FireplaceQu'] = "NoFire" #missing
#MisGarage
fil = ((X.GarageYrBlt.isnull()) & (X.GarageType.isnull()) & (X.GarageFinish.isnull()) &
(X.GarageQual.isnull()) & (X.GarageCond.isnull()))
fil1 = ((X.GarageYrBlt.notnull()) | (X.GarageType.notnull()) | (X.GarageFinish.notnull()) |
(X.GarageQual.notnull()) | (X.GarageCond.notnull()))
X.loc[fil1, 'MisGarage'] = 0
X.loc[fil, 'MisGarage'] = 1
#GarageYrBlt
X.loc[X.GarageYrBlt > 2200, 'GarageYrBlt'] = 2007 #correct mistake
X.loc[fil, 'GarageYrBlt'] = X['YearBuilt'] # if no garage, use the age of the building
#GarageType
X.loc[fil, 'GarageType'] = "NoGrg" #missing garage
#GarageFinish
X.loc[fil, 'GarageFinish'] = "NoGrg" #missing
#GarageQual
X.loc[fil, 'GarageQual'] = "NoGrg" #missing
#GarageCond
X.loc[fil, 'GarageCond'] = "NoGrg" #missing
#Fence
X.loc[X.Fence.isnull(), 'Fence'] = "NoFence" #missing fence
#Pool
fil = ((X.PoolArea == 0) & (X.PoolQC.isnull()))
X.loc[fil, 'PoolQC'] = 'NoPool'
# not useful features
del X['Id']
del X['MiscFeature'] # we already know it doesn't matter
del X['Condition1']
del X['Condition2']
del X['Exterior1st']
del X['Exterior2nd']
del X['Functional']
del X['Heating']
del X['PoolQC']
del X['RoofMatl']
del X['RoofStyle']
del X['SaleCondition']
del X['SaleType']
del X['Utilities']
del X['BsmtFinType1']
del X['BsmtFinType2']
del X['BsmtFinSF1']
del X['BsmtFinSF2']
del X['Electrical']
del X['Foundation']
del X['Street']
del X['Fence']
del X['LandSlope']
del X['LowQualFinSF']
del X['FireplaceQu']
del X['PoolArea']
del X['MiscVal']
del X['MoSold']
del X['YrSold']
# after model iterations
del X['KitchenAbvGr']
del X['GarageQual']
del X['GarageCond']
return X
The choice was driven by a combination of factors, for example:
Then there is a number of feature we want to create and we are not particularly sure if they will be helpful or not. For example, we have seen that some features can be safely converted into ordinal ones but others do not display an equally clear pattern and might be more useful as dummies. As another example, we created a new feature for the total number of baths and it looks promising but we are not sure if it is worth dropping all the other original features in its favor.
We can stay here all day trying to think about the right combination of transformations we need for the optimal set of features but we are getting better in the use of pipelines, it is time to exploit that.
We can design our transformers in order to have parameters that determine whether or not a feature should be transformed/created. In this way, we can tune those parameters with GridSearch or RandomSearch and obtain the best set for every model.
Note: the code is hidden but rich of insights (or so I like to believe), feel free to expand the next cell
class tr_numeric(BaseEstimator, TransformerMixin):
def __init__(self, SF_room=True, bedroom=True, bath=True, lot=True, service=True):
self.columns = [] # useful to well behave with FeatureUnion
self.SF_room = SF_room
self.bedroom = bedroom
self.bath = bath
self.lot = lot
self.service = service
def fit(self, X, y=None):
return self
def remove_skew(self, X, column):
X[column] = np.log1p(X[column])
return X
def SF_per_room(self, X):
if self.SF_room:
X['sf_per_room'] = X['GrLivArea'] / X['TotRmsAbvGrd']
return X
def bedroom_prop(self, X):
if self.bedroom:
X['bedroom_prop'] = X['BedroomAbvGr'] / X['TotRmsAbvGrd']
del X['BedroomAbvGr'] # the new feature makes it redundant and it is not important
return X
def total_bath(self, X):
if self.bath:
X['total_bath'] = (X[[col for col in X.columns if 'FullBath' in col]].sum(axis=1) +
0.5 * X[[col for col in X.columns if 'HalfBath' in col]].sum(axis=1))
del X['FullBath'] # redundant
del X['HalfBath'] # not useful anyway
del X['BsmtHalfBath']
del X['BsmtFullBath']
return X
def lot_prop(self, X):
if self.lot:
X['lot_prop'] = X['LotArea'] / X['GrLivArea']
return X
def service_area(self, X):
if self.service:
X['service_area'] = X['TotalBsmtSF'] + X['GarageArea']
del X['TotalBsmtSF']
del X['GarageArea']
return X
def transform(self, X, y=None):
for col in ['GrLivArea', '1stFlrSF', 'LotArea']:
X = self.remove_skew(X, col)
X = self.SF_per_room(X)
X = self.bedroom_prop(X)
X = self.total_bath(X)
X = self.lot_prop(X)
X = self.service_area(X)
self.columns = X.columns
return X
def get_feature_names(self):
return self.columns
class drop_columns(BaseEstimator, TransformerMixin):
'''
Drops columns that are not useful for the model
The decisions come from several iterations
'''
def __init__(self, lasso=False, ridge=False, forest=False, xgb=False, lgb=False):
self.columns = []
self.lasso = lasso
self.ridge = ridge
self.forest = forest
self.xgb = xgb
self.lgb = lgb
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
to_drop = [col for col in X.columns if 'NoGrg' in col] # dropping dummies that are redundant
to_drop += [col for col in X.columns if 'NoBsmt' in col]
if self.lasso:
to_drop += [col for col in X.columns if 'BsmtExposure' in col]
to_drop += [col for col in X.columns if 'BsmtCond' in col]
to_drop += [col for col in X.columns if 'ExterCond' in col]
to_drop += [col for col in X.columns if 'HouseStyle' in col]
to_drop += [col for col in X.columns if 'LotShape' in col]
to_drop += [col for col in X.columns if 'LotFrontage' in col]
to_drop += [col for col in X.columns if 'GarageYrBlt' in col]
to_drop += [col for col in X.columns if 'GarageType' in col]
to_drop += ['OpenPorchSF', '3SsnPorch']
if self.ridge:
to_drop += [col for col in X.columns if 'BsmtExposure' in col]
to_drop += [col for col in X.columns if 'BsmtCond' in col]
to_drop += [col for col in X.columns if 'ExterCond' in col]
to_drop += [col for col in X.columns if 'LotFrontage' in col]
to_drop += [col for col in X.columns if 'LotShape' in col]
to_drop += [col for col in X.columns if 'HouseStyle' in col]
to_drop += [col for col in X.columns if 'GarageYrBlt' in col]
to_drop += [col for col in X.columns if 'GarageCars' in col]
to_drop += [col for col in X.columns if 'BldgType' in col]
to_drop += ['OpenPorchSF', '3SsnPorch']
if self.forest:
to_drop += [col for col in X.columns if 'BsmtExposure' in col]
to_drop += [col for col in X.columns if 'BsmtCond' in col]
to_drop += [col for col in X.columns if 'ExterCond' in col]
to_drop += ['OpenPorchSF', '3SsnPorch']
if self.xgb:
to_drop += [col for col in X.columns if 'BsmtExposure' in col]
to_drop += [col for col in X.columns if 'BsmtCond' in col]
to_drop += [col for col in X.columns if 'ExterCond' in col]
if self.lgb:
to_drop += [col for col in X.columns if 'LotFrontage' in col]
to_drop += [col for col in X.columns if 'HouseStyle' in col]
to_drop += ['MisBsm']
for col in to_drop:
try:
del X[col]
except KeyError:
pass
self.columns = X.columns
return X
def get_feature_names(self):
return list(self.columns)
class make_ordinal(BaseEstimator, TransformerMixin):
'''
Transforms ordinal features in order to have them as numeric (preserving the order)
If unsure about converting or not a feature (maybe making dummies is better), make use of
extra_cols and unsure_conversion
'''
def __init__(self, cols, extra_cols=None, include_extra='include'):
self.cols = cols
self.extra_cols = extra_cols
self.mapping = {'Po':1, 'Fa': 2, 'TA': 3, 'Gd': 4, 'Ex': 5}
self.include_extra = include_extra # either include, dummies, or drop (any other option)
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
if self.extra_cols:
if self.include_extra == 'include':
self.cols += self.extra_cols
elif self.include_extra == 'dummies':
pass
else:
for col in self.extra_cols:
del X[col]
for col in self.cols:
X.loc[:, col] = X[col].map(self.mapping).fillna(0)
return X
class recode_cat(BaseEstimator, TransformerMixin):
'''
Recodes some categorical variables according to the insights gained from the
data exploration phase.
'''
def __init__(self, mean_weight=10, te_neig=True, te_mssc=True):
self.mean_tot = 0
self.mean_weight = mean_weight
self.smooth_neig = {}
self.smooth_mssc = {}
self.te_neig = te_neig
self.te_mssc = te_mssc
def smooth_te(self, data, target, col):
tmp_data = data.copy()
tmp_data['target'] = target
mean_tot = tmp_data['target'].mean()
means = tmp_data.groupby(col)['target'].mean()
counts = tmp_data.groupby(col)['target'].count()
smooth = ((counts * means + self.mean_weight * mean_tot) /
(counts + self.mean_weight))
return mean_tot, smooth
def fit(self, X, y):
if self.te_neig:
self.mean_tot, self.smooth_neig = self.smooth_te(data=X, target=y, col='Neighborhood')
if self.te_mssc:
self.mean_tot, self.smooth_mssc = self.smooth_te(X, y, 'MSSubClass')
return self
def tr_GrgType(self, data):
data['GarageType'] = data['GarageType'].map({'Basment': 'Attchd',
'CarPort': 'Detchd',
'2Types': 'Attchd' }).fillna(data['GarageType'])
return data
def tr_LotShape(self, data):
fil = (data.LotShape != 'Reg')
data['LotShape'] = 1
data.loc[fil, 'LotShape'] = 0
return data
def tr_LandCont(self, data):
fil = (data.LandContour == 'HLS') | (data.LandContour == 'Low')
data['LandContour'] = 0
data.loc[fil, 'LandContour'] = 1
return data
def tr_LandSlope(self, data):
fil = (data.LandSlope != 'Gtl')
data['LandSlope'] = 0
data.loc[fil, 'LandSlope'] = 1
return data
def tr_MSZoning(self, data):
data['MSZoning'] = data['MSZoning'].map({'RH': 'RM', # medium and high density
'C (all)': 'RM', # commercial and medium density
'FV': 'RM'}).fillna(data['MSZoning'])
return data
def tr_Alley(self, data):
fil = (data.Alley != 'NoAlley')
data['Alley'] = 0
data.loc[fil, 'Alley'] = 1
return data
def tr_LotConfig(self, data):
data['LotConfig'] = data['LotConfig'].map({'FR3': 'Corner', # corners have 2 or 3 free sides
'FR2': 'Corner'}).fillna(data['LotConfig'])
return data
def tr_BldgType(self, data):
data['BldgType'] = data['BldgType'].map({'Twnhs' : 'TwnhsE',
'2fmCon': 'Duplex'}).fillna(data['BldgType'])
return data
def tr_MasVnrType(self, data):
data['MasVnrType'] = data['MasVnrType'].map({'BrkCmn': 'BrkFace'}).fillna(data['MasVnrType'])
return data
def tr_HouseStyle(self, data):
data['HouseStyle'] = data['HouseStyle'].map({'1.5Fin': '1.5Unf',
'2.5Fin': '2Story',
'2.5Unf': '2Story',
'SLvl': 'SFoyer'}).fillna(data['HouseStyle'])
return data
def tr_Neighborhood(self, data):
if self.te_neig:
data['Neighborhood'] = data['Neighborhood'].map(self.smooth_neig).fillna(self.mean_tot)
return data
def tr_MSSubClass(self, data):
if self.te_mssc:
data['MSSubClass'] = data['MSSubClass'].map(self.smooth_mssc).fillna(self.mean_tot)
return data
def transform(self, X, y=None):
X = self.tr_GrgType(X)
X = self.tr_LotShape(X)
X = self.tr_LotConfig(X)
X = self.tr_MSZoning(X)
X = self.tr_Alley(X)
X = self.tr_LandCont(X)
X = self.tr_BldgType(X)
X = self.tr_MasVnrType(X)
X = self.tr_HouseStyle(X)
X = self.tr_Neighborhood(X)
X = self.tr_MSSubClass(X)
return X
numeric_pipe = Pipeline([('fs', dfp.feat_sel('numeric')),
('imp', dfp.df_imputer(strategy='median')),
('transf', tr_numeric())])
cat_pipe = Pipeline([('fs', dfp.feat_sel('category')),
('imp', dfp.df_imputer(strategy='most_frequent')),
('ord', make_ordinal(['BsmtQual', 'KitchenQual', 'ExterQual', 'HeatingQC'],
extra_cols=['BsmtExposure', 'BsmtCond', 'ExterCond'],
include_extra='include')),
('recode', recode_cat()),
('dummies', dfp.dummify())])
processing_pipe = dfp.FeatureUnion_df(transformer_list=[('cat', cat_pipe),
('num', numeric_pipe)])
full_pipe = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_pipe),
('scaler', dfp.df_scaler()),
('dropper', drop_columns())])
tmp = train_set.copy()
tmp = full_pipe.fit_transform(tmp, y) # we need to provide y too for the target encoding
tmp.head()
We are now ready to use these parameters in our search or the optimal model.
The key, and for some the novelty, of this phase is to tune both the parameters of the chosen algorithm (like the regularization in Lasso) and the one controlling on the creation of new features or on how to scale the data at the same time.
It is important, as before, to not fall too deep into the rabbit hole we need to be pragmatic. Most of the parameters we want to tune will have such a small influence on the model and such a high cost in terms of computation time that we have to be able to explore the sections of the parameter space that really matter.
To achieve that, it is a good practice to quickly iterate on different configurations and keep track of the effects of our choices. Even if this means coming back to the previous section and create something new. Depending on when you read this notebook, you will see different iterations of this phase (or probably the final iteration). I will try to explain as good as I can why some things were done and some other weren't.
As of now, the full pipeline (without the model) looks like this.
full_pipe.get_params()
Pretty cumbersome but at the bottom of it we have all the hyperparameters we want to tune. Let's tune model by model.
As a reminder, in the model selection phase we have as a baseline for Lasso given by
results.loc[results.model_name=='lasso', ]
Let's then create a pipeline for this model
lasso_pipe = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_pipe),
('scaler', dfp.df_scaler()),
('dropper', drop_columns()),
('lasso', Lasso(alpha=0.01))])
The parameter space I want to explore, searching for the best configuration, is
lasso_params = {'lasso__alpha': [0.05, 0.01, 0.005, 0.001],
'lasso__tol': [0.005, 0.001, 0.0005, 0.0001],
'proc__cat__dummies__drop_first': [True, False],
'proc__cat__recode__te_mssc': [True, False],
'proc__cat__recode__te_neig': [True, False],
'proc__cat__ord__include_extra': ['include', 'dummies'],
'proc__num__transf__SF_room': [True, False],
'proc__num__transf__bath': [True, False],
'proc__num__transf__bedroom': [True, False],
'proc__num__transf__lot': [True, False],
'proc__num__transf__service': [True, False],
'scaler__method': ['standard', 'robust'],
'dropper__drop': [True, False]}
These are 32768 different configurations to test on 5 folds. Even though Lasso is pretty quick and the data are not big, it can take quite a long time.
After a few iterations, and this is the last time I put all the steps of the process or this notebook becomes unbearable, we can restrict our search a bit. Namely,
These are still a lot of configurations. We will thus test only 200 random combinations
lasso_params = {'lasso__alpha': [0.002, 0.001, 0.0009, 0.0008],
'lasso__tol': [0.005, 0.001, 0.0005, 0.0001],
'proc__cat__dummies__drop_first': [True, False],
'proc__cat__ord__include_extra': ['include', 'dummies'],
'proc__num__transf__SF_room': [True, False],
'proc__num__transf__bedroom': [True, False],
'proc__num__transf__lot': [True, False],
'scaler__method': ['standard', 'robust']}
result_lasso, bp_lasso, best_lasso = grid_search(train_set, y, lasso_pipe,
param_grid=lasso_params, cv=folds, scoring='neg_mean_squared_error',
random=100)
# todo: apparently dropping the extra columns to transform to ordinal gives problems. It has to be fixed
result_lasso.head(10)
After this search, the best configuration is given by
bp_lasso
The model has now these coefficients
get_coef(best_lasso)
While the performance on our test set (which should not influence our hyperparameter tuning but it is nice to keep track of its evolution) is
tmp = test_set.copy()
preds = best_lasso.predict(tmp)
print(f'Test set RMSE: {round(np.sqrt(mean_squared_error(y_test, preds)), 4)}')
print(f'Test set MAE: {round(mean_absolute_error(np.expm1(y_test), np.expm1(preds)), 2)}')
plot_predictions(test_set, y_test, preds)
To come back to the spirit of the Kaggle competition, this result is extremely close to the one that you obtain in public leaderboard if you simply predict on the provided test set. This would approximatevely score in the top 34% and, if we retrain on the full test, top 31%. This is not very competitive but the good news is that we can focus on increase this cross validated result to also obtain a better score in the competition.
After a few iterations, we get rid of the features that are not helpful and with the final pipeline for Lasso
numeric_lasso = Pipeline([('fs', dfp.feat_sel('numeric')),
('imp', dfp.df_imputer(strategy='median')),
('transf', tr_numeric(lot=False,
bedroom=False,
SF_room=False))])
cat_lasso = Pipeline([('fs', dfp.feat_sel('category')),
('imp', dfp.df_imputer(strategy='most_frequent')),
('ord', make_ordinal(['BsmtQual', 'KitchenQual', 'ExterQual', 'HeatingQC'],
extra_cols=['BsmtExposure', 'BsmtCond', 'ExterCond'],
include_extra='include')),
('recode', recode_cat()),
('dummies', dfp.dummify(drop_first=True))])
processing_lasso = dfp.FeatureUnion_df(transformer_list=[('cat', cat_lasso),
('num', numeric_lasso)])
lasso_pipe = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_lasso),
('scaler', dfp.df_scaler(method='standard')),
('dropper', drop_columns(lasso=True)),
('lasso', Lasso(alpha=0.001, tol=0.005))])
With coefficients
lasso_oof, coefs = cv_score(train_set, y, folds, lasso_pipe, imp_coef=True)
coefs
And the following in fold and out of fold predictions
print(f'Train set RMSE: {round(np.sqrt(mean_squared_error(y, lasso_oof)), 4)}')
print(f'Train set MAE: {round(mean_absolute_error(np.expm1(y), np.expm1(lasso_oof)), 2)}')
plot_predictions(train_set, y, lasso_oof)
tmp = train_set.copy()
lasso_pipe.fit(tmp, y)
tmp = test_set.copy()
preds = lasso_pipe.predict(tmp)
print(f'Test set RMSE: {round(np.sqrt(mean_squared_error(y_test, preds)), 4)}')
print(f'Test set MAE: {round(mean_absolute_error(np.expm1(y_test), np.expm1(preds)), 2)}')
plot_predictions(test_set, y_test, preds)
Our effort of tuning Lasso a bit more, considering the creation of features and their selection as part of the tuning, led to an improvement of the model performance from the original 0.123 to the current 0.117. In terms of mean absolute error, we go from a little bit more than 15000 dollars to a bit less than 15000. In a real project, it is a good idea to keep track of the time and effort we put in fine tuning our model and evaluate if the project gets a significant benefit from this time and effort.
In this case, the tuning took us not more than 2 hours and a model that is about 1000 dollars more accurate is worth 2 hours of work.
To be fair, we should not compare the results obtained on the training set. Even though we are using our pipeline and k-fold cross-validation, some information is leaking from the training folds to the test folds in the form of our decisions iteration after iteration. In other words, each iteration we tweak the model to get a better score and this is a good strategy to tune it. However, we are, in a sense, overfitting the test folds.
Therefore a proper indication of what our model is doing can only come from the test set. It is very important that we use this set only to monitor the performance and not to take further modeling decisions or we risk of overfitting that set too.
On the test set, the model improved of 0.009 in RMSE (7% better) and of about 2000 dollars (12% better).
The strategy stays the same
results.loc[results.model_name=='ridge', ]
ridge_pipe = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_pipe),
('scaler', dfp.df_scaler()),
('dropper', drop_columns()),
('ridge', Ridge())])
ridge_params = {'ridge__alpha': [2, 1.7, 1.5, 1.3, 1, 0.9],
'ridge__tol': [0.005, 0.001, 0.0005, 0.0001],
'proc__cat__dummies__drop_first': [True, False],
'proc__cat__ord__include_extra': ['include', 'dummies'],
'proc__num__transf__SF_room': [True, False],
'proc__num__transf__lot': [True, False],
'scaler__method': ['standard', 'robust']}
result_ridge, bp_ridge, best_ridge = grid_search(train_set, y, ridge_pipe,
param_grid=ridge_params, cv=folds, scoring='neg_mean_squared_error',
random=100)
result_ridge.head(10)
bp_ridge
get_coef(best_ridge)
tmp = test_set.copy()
preds = best_ridge.predict(tmp)
print(f'Test set RMSE: {round(np.sqrt(mean_squared_error(y_test, preds)), 4)}')
print(f'Test set MAE: {round(mean_absolute_error(np.expm1(y_test), np.expm1(preds)), 2)}')
plot_predictions(test_set, y_test, preds)
Once again, we use these information to iterate a few more times on the various configurations. Here you can see, for example, that we drop a different set of features. The decision was driven by the study of the coefficients in cross-validation. If a coefficient was varying a lot from fold to fold, we performed an iteration without that coefficient and checked if the model was improving.
numeric_ridge = Pipeline([('fs', dfp.feat_sel('numeric')),
('imp', dfp.df_imputer(strategy='median')),
('transf', tr_numeric(SF_room=False))])
cat_ridge = Pipeline([('fs', dfp.feat_sel('category')),
('imp', dfp.df_imputer(strategy='most_frequent')),
('ord', make_ordinal(['BsmtQual', 'KitchenQual', 'ExterQual', 'HeatingQC'],
extra_cols=['BsmtExposure', 'BsmtCond', 'ExterCond'],
include_extra='include')),
('recode', recode_cat()),
('dummies', dfp.dummify(drop_first=True))])
processing_ridge = dfp.FeatureUnion_df(transformer_list=[('cat', cat_ridge),
('num', numeric_ridge)])
ridge_pipe = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_ridge),
('scaler', dfp.df_scaler(method='robust')),
('dropper', drop_columns(ridge=True)),
('ridge', Ridge(alpha=2, tol=0.0001))])
The model has then these coefficients
ridge_oof, coefs = cv_score(train_set, y, folds, ridge_pipe, imp_coef=True)
coefs
print(f'Train set RMSE: {round(np.sqrt(mean_squared_error(y, ridge_oof)), 4)}')
print(f'Train set MAE: {round(mean_absolute_error(np.expm1(y), np.expm1(ridge_oof)), 2)}')
plot_predictions(train_set, y, ridge_oof)
tmp = train_set.copy()
ridge_pipe.fit(tmp, y)
tmp = test_set.copy()
preds = ridge_pipe.predict(tmp)
print(f'Test set RMSE: {round(np.sqrt(mean_squared_error(y_test, preds)), 4)}')
print(f'Test set MAE: {round(mean_absolute_error(np.expm1(y_test), np.expm1(preds)), 2)}')
plot_predictions(test_set, y_test, preds)
This time our 2 hours of work (I am considering all the iterations made up until this moment) led to a 3% improvement in RMSE and to a model about 1000 dollars more accurate.
The scaler should not matter as in general scaling the data for a tree-based algorithm will only increase the speed and not the outcome.
results.loc[results.model_name=='forest', ]
forest_pipe = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_pipe),
('scaler', dfp.df_scaler()),
('dropper', drop_columns()),
('forest', RandomForestRegressor(n_estimators=300, n_jobs=-1, random_state=32))])
forest_params = {'forest__max_depth': [10, 20, 30, None],
'forest__max_features': ['auto', 'sqrt', 'log2'],
'forest__min_samples_leaf': [1, 3, 5, 10],
'forest__min_samples_split': [2, 4, 6, 8],
'proc__cat__dummies__drop_first': [True, False],
'proc__cat__ord__include_extra': ['include', 'dummies'],
'proc__num__transf__SF_room': [True, False],
'proc__num__transf__bath': [True, False],
'proc__num__transf__bedroom': [True, False],
'proc__num__transf__lot': [True, False],
'proc__num__transf__service': [True, False]}
result_forest, bp_forest, best_forest = grid_search(train_set, y, forest_pipe,
param_grid=forest_params, cv=folds, scoring='neg_mean_squared_error',
random=100)
result_forest.head(10)
bp_forest
get_feature_importance(best_forest)
tmp = test_set.copy()
preds = best_forest.predict(tmp)
print(f'Test set RMSE: {round(np.sqrt(mean_squared_error(y_test, preds)), 4)}')
print(f'Test set MAE: {round(mean_absolute_error(np.expm1(y_test), np.expm1(preds)), 2)}')
plot_predictions(test_set, y_test, preds)
While the model is not performing very well, it is also very easy to tune (or, to say it better, tuning it doesn't improve it that much). The final version of the RandomForest is then
numeric_forest = Pipeline([('fs', dfp.feat_sel('numeric')),
('imp', dfp.df_imputer(strategy='median')),
('transf', tr_numeric(SF_room=False,
bedroom=False,
lot=False))])
cat_forest = Pipeline([('fs', dfp.feat_sel('category')),
('imp', dfp.df_imputer(strategy='most_frequent')),
('ord', make_ordinal(['BsmtQual', 'KitchenQual', 'ExterQual', 'HeatingQC'],
extra_cols=['BsmtExposure', 'BsmtCond', 'ExterCond'],
include_extra='include')),
('recode', recode_cat()),
('dummies', dfp.dummify(drop_first=True))])
processing_forest = dfp.FeatureUnion_df(transformer_list=[('cat', cat_forest),
('num', numeric_forest)])
forest_pipe = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_forest),
('scaler', dfp.df_scaler(method='robust')),
('dropper', drop_columns(forest=True)),
('forest', RandomForestRegressor(n_estimators=1500, max_depth=30,
max_features='sqrt',
n_jobs=-1, random_state=32))])
forest_oof, coefs = cv_score(train_set, y, folds, forest_pipe, imp_coef=True)
coefs
print(f'Train set RMSE: {round(np.sqrt(mean_squared_error(y, forest_oof)), 4)}')
print(f'Train set MAE: {round(mean_absolute_error(np.expm1(y), np.expm1(forest_oof)), 2)}')
plot_predictions(train_set, y, forest_oof)
tmp = train_set.copy()
forest_pipe.fit(tmp, y)
tmp = test_set.copy()
preds = forest_pipe.predict(tmp)
print(f'Test set RMSE: {round(np.sqrt(mean_squared_error(y_test, preds)), 4)}')
print(f'Test set MAE: {round(mean_absolute_error(np.expm1(y_test), np.expm1(preds)), 2)}')
plot_predictions(test_set, y_test, preds)
Once again, the iterations led to a 6% improvement for RMSE and to a model 1500 dollars more accurate. While the result is worse than the previous 2 models, the tuning took also very little time (almost completely run time).
The strategy here will be a little different as we want to make use of early stopping and tune the other hyperparameters. For some reasons, running the grid_search function above here on Kaggle, is resulting in an endless run. Luckily it is not the case locally and we can tune some of the hyperparameters following this method.
However, the most important hyperparameters to tune are arguably the learning rate and the number of estimators. We thus implement a simple method to make use of the early stopping rounds.
results[results.model_name == 'xgb']
def xgb_train(train, target, pipe, kfolds):
oof = np.zeros(len(train))
pd.options.mode.chained_assignment = None
feat_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(kfolds.split(train.values, target.values)):
print("fold n°{}".format(fold_))
trn_data = train.iloc[trn_idx].copy()
val_data = train.iloc[val_idx].copy()
trn_target = target.iloc[trn_idx]
val_target = target.iloc[val_idx]
trn_data = pipe.fit_transform(trn_data, trn_target)
val_data = pipe.transform(val_data)
clf = xgb.XGBRegressor(n_estimators=10000, objective='reg:squarederror',
max_depth=3, colsample_bytree=0.5, subsample=0.5,
reg_alpha=0.4, reg_lambda=0.6,
learning_rate=0.01, n_jobs=-1, random_state=31).fit(
trn_data, trn_target,
eval_set=[(val_data, val_target)],
eval_metric='rmse', early_stopping_rounds=200, verbose=2000)
oof[val_idx] = clf.predict(pipe.transform(train.iloc[val_idx]),
ntree_limit=clf.best_iteration)
fold_df = pd.DataFrame()
fold_df["feat"] = trn_data.columns
fold_df["score"] = clf.feature_importances_
fold_df['fold'] = fold_ + 1
feat_df = pd.concat([feat_df, fold_df], axis=0)
feat_df = feat_df.groupby('feat')['score'].agg(['mean', 'std'])
feat_df['abs_sco'] = (abs(feat_df['mean']))
feat_df = feat_df.sort_values(by=['abs_sco'],ascending=False)
del feat_df['abs_sco']
print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))
pd.options.mode.chained_assignment = 'warn'
return oof, feat_df
numeric_xgb = Pipeline([('fs', dfp.feat_sel('numeric')),
('imp', dfp.df_imputer(strategy='median')),
('transf', tr_numeric(bedroom=False,
lot=False))])
cat_xgb = Pipeline([('fs', dfp.feat_sel('category')),
('imp', dfp.df_imputer(strategy='most_frequent')),
('ord', make_ordinal(['BsmtQual', 'KitchenQual', 'ExterQual', 'HeatingQC'],
extra_cols=['BsmtExposure', 'BsmtCond', 'ExterCond'],
include_extra='include')),
('recode', recode_cat()),
('dummies', dfp.dummify(drop_first=True))])
processing_xgb = dfp.FeatureUnion_df(transformer_list=[('cat', cat_xgb),
('num', numeric_xgb)])
xgb_pipe = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_xgb),
('scaler', dfp.df_scaler(method='robust')),
('dropper', drop_columns(xgb=True))])
tmp = train_set.copy()
oof_xgb, f_i = xgb_train(tmp, y, xgb_pipe, folds)
f_i
print(f'Train set RMSE: {round(np.sqrt(mean_squared_error(y, oof_xgb)), 4)}')
print(f'Train set MAE: {round(mean_absolute_error(np.expm1(y), np.expm1(oof_xgb)), 2)}')
plot_predictions(train_set, y, oof_xgb)
Very similarly to the previous case, we tune some hyperparameters locally and display only the latest phase
results[results.model_name == 'lgb']
def lgb_train(train, target, pipe, kfolds):
oof = np.zeros(len(train))
pd.options.mode.chained_assignment = None
feat_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(kfolds.split(train.values, target.values)):
print("fold n°{}".format(fold_))
trn_data = train.iloc[trn_idx].copy()
val_data = train.iloc[val_idx].copy()
trn_target = target.iloc[trn_idx]
val_target = target.iloc[val_idx]
trn_data = pipe.fit_transform(trn_data, trn_target)
val_data = pipe.transform(val_data)
clf = lgb.LGBMRegressor(n_estimators=20000, learning_rate=0.01,
num_leaves=4, max_depth=3,
subsample=0.8, colsample_bytree=0.5).fit(
trn_data, trn_target,
eval_set=[(val_data, val_target)],
eval_metric='rmse', early_stopping_rounds=200, verbose=1000)
oof[val_idx] = clf.predict(pipe.transform(train.iloc[val_idx]),
ntree_limit=clf.best_iteration_)
fold_df = pd.DataFrame()
fold_df["feat"] = trn_data.columns
fold_df["score"] = clf.feature_importances_
fold_df['fold'] = fold_ + 1
feat_df = pd.concat([feat_df, fold_df], axis=0)
feat_df = feat_df.groupby('feat')['score'].agg(['mean', 'std'])
feat_df['abs_sco'] = (abs(feat_df['mean']))
feat_df = feat_df.sort_values(by=['abs_sco'],ascending=False)
del feat_df['abs_sco']
print("CV score: {:<8.5f}".format(mean_squared_error(oof, target)**0.5))
pd.options.mode.chained_assignment = 'warn'
return oof, feat_df
numeric_lgb = Pipeline([('fs', dfp.feat_sel('numeric')),
('imp', dfp.df_imputer(strategy='median')),
('transf', tr_numeric(bedroom=False,
SF_room=False,
lot=False))])
cat_lgb = Pipeline([('fs', dfp.feat_sel('category')),
('imp', dfp.df_imputer(strategy='most_frequent')),
('ord', make_ordinal(['BsmtQual', 'KitchenQual', 'ExterQual', 'HeatingQC'],
extra_cols=['BsmtExposure', 'BsmtCond', 'ExterCond'],
include_extra='include')),
('recode', recode_cat()),
('dummies', dfp.dummify(drop_first=True))])
processing_lgb = dfp.FeatureUnion_df(transformer_list=[('cat', cat_lgb),
('num', numeric_lgb)])
lgb_pipe = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_lgb),
('scaler', dfp.df_scaler(method='robust')),
('dropper', drop_columns(lgb=True))])
tmp = train_set.copy()
oof_lgb, f_i = lgb_train(tmp, y, lgb_pipe, folds)
f_i
print(f'Train set RMSE: {round(np.sqrt(mean_squared_error(y, oof_lgb)), 4)}')
print(f'Train set MAE: {round(mean_absolute_error(np.expm1(y), np.expm1(oof_lgb)), 2)}')
plot_predictions(train_set, y, oof_lgb)
We got our first improvements, our models are already better than, for example, a simple regression using the size of the house in combination with the neighborhood.
It is a good moment to stop and think about what is our model saying and what is missing.
This section is made out of intuition, I am genuinely unaware if what follows makes sense mathematically. Any feedback on this part will be particularly appreciated
First, let's put our prediction together with the data that led to them.
err_an = train_set.copy()
err_an['lasso_oof'] = lasso_oof
err_an['ridge_oof'] = ridge_oof
err_an['forest_oof'] = forest_oof
err_an['xgb_oof'] = oof_xgb
err_an['lgb_oof'] = oof_lgb
err_an['target'] = y
err_an.head()
We can see how all the predictions are very correlated to one another (and with the target, this is good)
exp.plot_correlations(err_an[[col for col in err_an.columns if '_oof' in col]+['target']],
target='target', annot=True)
If we compute the residuals and plot them, the pattern looks even more evident.
oof_cols = [col for col in err_an.columns if '_oof' in col]
for col in oof_cols:
name = col.replace('_oof', '_res')
err_an[name] = err_an['target'] - err_an[col]
exp.corr_target(err_an, 'target',
[col for col in err_an.columns if '_oof' in col]+
[col for col in err_an.columns if '_res' in col])
Looking at the residual plots, it appears evident that all the models we trained so far are underestimating the price of low costs houses and overestimating the more expensive ones. This could be because we used some target encoding or simply that we are overestimating, for example, the importance of the house size.
Since the predictions do not change much from model to model, we can simply focus on one of them. For example, let's focus on LightGBM.
We can try to see if there are interesting relations between the residuals and the original features.
exp.plot_correlations(err_an, target='lgb_res')
Which shows little to nothing. However, for categorical features, we can start focusing on the feature that was used both to stratify our folds (and test set) and then to be target encoded: Neighborhood.
err = exp.segm_target(err_an, 'Neighborhood', 'lgb_res')
tar = exp.segm_target(err_an, 'Neighborhood', 'target')
tot = pd.merge(err.reset_index(), tar.reset_index(), on='Neighborhood', suffixes=('_res', '_target'))
del tot['count_target']
tot
tot.corr()
A few noticeable things are:
This makes me consider if it would be a good idea to not use the target encoding variables and see if that pattern in the error disappears.
Another possible test is to see if some variables we did not include, for example
exp.segm_target(err_an, 'Exterior1st', 'lgb_res')
This shows how the MetalSd exterior leads to a particularly different pattern in the distribution of the error. A direct inspection of these houses shows the following
err_an[err_an.Exterior1st == 'MetalSd'].describe() - err_an[err_an.Exterior1st != 'MetalSd'].describe()
In other words, houses with that particular exterior
We could then consider to include this feature as well and see how the model reacts.
Another approach would be to explore the entries with the biggest errors. For example
def high_low_errors(data, *, res_list=None, n_samples=50,
target=None, pred_list=None, mean=False,
abs_err=True, common=False):
df = data.copy()
if pred_list:
res_list = []
for col in pred_list:
name = col + '_res'
res_list.append(name)
df[name] = df[target] - df[col]
errors = {}
if mean:
df['mean_res'] = df[res_list].mean(axis=1)
res_list += ['mean_res']
for col in res_list:
if abs_err:
if col == 'abs_err':
name = 'abs_err'
else:
name = 'abs_' + col
df[name] = abs(df[col])
else:
name = col
high_err = df.sort_values(name, ascending=False).head(n_samples)
low_err = df.sort_values(name, ascending=False).tail(n_samples)
try:
errors[name] = high_err.describe(include='all').drop(index=['top', 'count', 'freq']).fillna(0) - \
low_err.describe(include='all').drop(index=['top', 'count', 'freq']).fillna(0)
except KeyError:
errors[name] = high_err.describe().fillna(0) - low_err.describe().fillna(0)
return errors
h_v_l = high_low_errors(err_an, res_list=[col for col in err_an.columns if '_res' in col], mean=True)
h_v_l['abs_mean_res']
We may have to explain to the stakeholders of the project what is our model doing to make its prediction. This is a very complicated question when we step away from the simplest models (that, to be honest, are performing well enough if we compare them to the more complex ones here presented). However, some simple things might just do the job.
Let's focus on our RandomForest, we know already how off the predictions are and, if we don't want to scroll up this notebook, we can see what features are the most important.
imps = get_feature_importance(forest_pipe)
imps.head(9)
features = imps.head(9).feat.values
# we need the data to be tranformed, so I break the pipe in 2 parts
proc = Pipeline([('gen_cl', general_cleaner()),
('proc', processing_forest),
('scaler', dfp.df_scaler(method='robust')),
('dropper', drop_columns(forest=True))])
tmp = proc.fit_transform(train_set.copy(), y)
ls_tm = RandomForestRegressor(n_estimators=1500, max_depth=30,
max_features='sqrt',
n_jobs=-1, random_state=32)
ls_tm.fit(tmp, y)
fig, ax = plt.subplots(3,3, figsize=(15,10))
plot_partial_dependence(ls_tm, tmp, features, ax=ax,
n_jobs=-1, grid_resolution=50)
fig.subplots_adjust(hspace=0.3)
We see that around the mean (the 0 in this graphs), there is generally a big step up in price predicted.
Or, we can combine 2 features to get different insights
features = [('OverallQual', 'service_area'), ('OverallQual', 'GrLivArea'),
('OverallQual', 'Neighborhood'), ('Neighborhood', 'service_area')]
fig, ax = plt.subplots(2,2, figsize=(12,12))
plot_partial_dependence(ls_tm, tmp, features, ax=ax,
n_jobs=-1, grid_resolution=20)
fig.subplots_adjust(hspace=0.3)
... to be continued ....